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Abstract — The problem of universally predicting an individual 
continuous sequence using a deterministic finite-state machine 
(FSM) is considered. The empirical mean is used as a reference 
as it is the constant that fits a given sequence within a minimal 
square error. With this reference, a reasonable prediction per- 
formance is the regret, namely the excess square-error over the 
reference loss, the empirical variance. The paper analyzes the 
tradeoff between the number of states of the universal FSM and 
the attainable regret. It first studies the case of a small number 
of states. A class of machines, denoted Degenerated Tracking 
Memory (DTM), is defined and the optimal machine in this 
class is shown to be the optimal among all machines for small 
enough number of states. Unfortunately, DTM machines become 
suboptimal as the number of available states increases. Next, 
the Exponential Decaying Memory (EDM) machine, previously 
used for predicting binary sequences, is considered. While this 
machine has poorer performance for small number of states, 
it achieves a vanishing regret for large number of states. 
Following that, an asymptotic lower bound of 0(fc^^/'^) on the 
achievable regret of any fc-state machine is derived. This bound 
is attained asymptotically by the EDM machine. Furthermore, 
a new machine, denoted the Enhanced Exponential Decaying 
Memory machine, is shown to outperform the EDM machine for 
any number of states. 

Index Terms — Universal prediction, individual continuous se- 
quences, finite-memory, least-squares. 



I. Introduction 

Consider a continuous -valued individual sequence 
Xi, . . . , Xn, where each sample is assumed to be bounded in 
the interval [a, b] but otherwise arbitrary with no underlying 
statistics. Suppose that at each time t, after observing 
x{ = xi, . . . ,Xt, a predictor guesses the next outcome Xt+i 
and incurs a square error prediction loss {xt+i — Xt+i)'^- A 
reasonable reference for the predictor is the best constant that 
fits the entire sequence within a minimal square error This 
constant is the empirical mean x = ^ Y^"=i ^t, and its square 
error is the sequence's empirical variance - X]"=i(^* ~ ^)^- 
Let Xu.i, ■ ■ ■ ,Xu.n denote the predictions of a (universal) 
predictor U. When the empirical mean is used as a reference, 
the excess loss of U over the empirical mean, for an individual 
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sequence x", is named the regret: 

R{U, x",) = - V(a;, - x^,tf - - V(:e, - x)^ . (1) 
n ^-^ n ^-^ 

t=i t=i 

In the setting discussed in this paper, the individual setting, 
the performance of U is judged by the incurred regret of the 
worst sequence, i.e., 

maxi?(C/, x") . 

Thus, the optimal U should attain 

min max i?(C/, x") . 

U a:" 

When there are no constraints on the universal predictor, this 
optimal U is the Cumulative Moving Average (CMA): 

i*+i = (l-^)x. + ^^. (2) 

where the maximal regret tends to zero with the sequence 
length n ijT], Q. Note that while the reference, the empirical 
mean predictor, is a constant and needs a single state memory, 
the CMA predictor is unconstrained and requires an ever 
growing amount of memory. A natural question arises - what 
happens if the universal predictor is constrained to be a finite 
fc-state machine? This is the problem considered in this paper. 

Universal estimation and prediction problems where the 
estimator/predictor is a fc-state machine have been explored 
extensively in the past years. Cover |3| studied hypothesis 
testing problem where the tester has a finite memory. Hellman 
|4 | studied the problem of estimating the mean of a Gaussian 
(or more generally stochastic) sequence using a finite state ma- 
chine. This problem is closely related to our problem and may 
be considered as a stochastic version of it: if one assumes that 
the data is Gaussian, then predicting it with a minimal mean 
square error essentially boils to estimating its mean. More 
recently, the finite-memory universal prediction problem for 
individual binary sequences with various loss functions was 
explored thoroughly in ||5]-||T0|. The finite-memory universal 
portfolio selection problem (that dealt with continuous-valued 
sequences but considered a very unique loss function) was 
also explored recently pT). Yet, the basic problem of finite- 
memory universal prediction of continuous-valued, individual 
sequences with square error loss was left unexplored so far. 
This paper provides a solution for this problem, presenting 
such universal predictors attaining a vanishing regret when 
a large memory is allowed, but also maintaining an optimal 
tradeoff between the regret and the number of states used by 
the universal predictor. 
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The outhne of the paper is as follows. In section |ll] we 
formulate the discussed problem and present guidelines that 



will be used throughout this paper. Section III is devoted 
to universal prediction with a small number of states. We 
present the class of the Degenerated Tracking Memory (DTM) 
machines, an algorithm for constructing the optimal DTM 
machine and a lower bound on the achievable regret. The 
optimal DTM machine is shown to be the optimal solution 
among all machines when a small enough number of states 
is available. Sections |IV[ \V\ and |VI] are devoted to universal 
prediction using a large number of states. We start in IV 
by proposing a known universal machine - the Exponential 
Decaying Memory (EDM) machine - proving asymptotic 
lower and upper bounds on its worst regret. In section [V] we 
present an asymptotic lower bound on the worst regret of any 
deterministic fc-states machine and in section VI we present 



a new machine named the Enhanced Exponential Decaying 
Memory (E-EDM) machine that can attain any vanishing 
desired regret while outperforming the EDM machine. In 
section IVIII we summarize the results and discuss further 
research. 

II. Preliminaries 

We consider universal predictors with continuous-valued 
input samples that are assumed to be bounded in the interval 
[a, b]. Giving a sequential predictor, we would like to compare 
the square error incurred by its predictions to the loss incurred 
by the empirical mean - the best off-line constant predictor 
In other words, the reference class comprises all predictors 
that know the entire sequence in advance, however can predict 
throughout only a single value. The best predictor among this 
class is the empirical mean, where its induced loss is the 
empirical variance. 

Definition 1: For a given sequence {xi, . . . , a;„}, the excess 
loss of a universal predictor U with predictions {xi,...,a;„} 
over the best constant predictor, the empirical mean x — 
i ^t' is termed the regret of the sequence and is 

therefore giving by 



-, n 1 ^ 

R{U, X-,) = - V(xt - xtf - - V(x* - x)' 



(3) 



t=i 



We analyze the performance of a universal predictor U by its 
worst sequence, i.e., by the sequence that induces maximal 
regret 

RmaAU)=maxR{U,x'l), (4) 

where we shall take the length of the sequence, n, to infinity. 
The notations x^ and {xtY^^i vised throughout this paper 
to denote {xi, . . . , Xn}- 

The universal predictors considered in this work are memory 
limited. Finite-State Machine (FSM) is a commonly used 
model for sequential machines with a limited amount of 
storage. We focus here on time-invariant FSM. 

Definition 2: A deterministic finite-state machine is defined 
by: 

• An array of k states where {Si , . . . ,Sk} denote the value 
assigned to each state. 



The transition of the machine between 
states is defined by a threshold set T i — 

each state i, where m„ ^ and md.i are the maximum 
number of states allowed to be crossed on the way up 
and down from state i, correspondingly. Hence, if at 
time t the machine is at state i and the input sample xt 
satisfies Tj j_i < Xt < Ti j, the machine jumps j states. 
Note that the thresholds are non-intersecting, where the 
union of them covers the interval [a, b] (where each 
input sample is assumed to be bounded in [a, b]). 
Equivalently, a transition function (p{i, x), that is, the next 
state given that the current state and input sample are i 
and X, can be defined 



rridi + 1 



^p{i,x) = < 



-"Id, 



— 1 X ^ Tl__jyi^ .^ 

< X < Ti^-nid . + 1 



._2 < X < Ti^m^_._i 

J— 1 ^ X <i Ti j-fi ■ 



An FSM predictor works as follows - suppose at time t the 
machine is at state i, then the prediction is Xt ^ Si, the value 
assigned to state i. On receiving the input sample xt, the 
machine jumps to the next state (/^(z, xt). The incurred loss 
for time t is then [xt — xtY'. 

Throughout this paper we discuss predictors designed for 
input samples that are bounded in [0,1]. One can easily 
verify that any FSM that achieves a regret smaller than R 
for any sequence bounded in [0, 1], can be transformed into 
an FSM that achieves a regret smaller than R{b — of' for 
any sequence bounded in [a, h], by applying the following 
simple transformation - each state value Si is transformed into 
a + (6 — a)Si and each threshold set T ^ into a + (6 — a)T j. 
Thus, all the results presented in this paper can be expanded 
to the more general case, where each individual sequence is 
assumed to be bounded in [a, b]. 

To conclude this section, we provide the definition of a 
minimal circle and a Theorem that we will use throughout 
this paper. A version of this Theorem was first given in |12] 
Theorem 6.5] - the worst binary sequence for a given FSM 
with respect to (w.rt.) the log-loss function endlessly rotates 
the machine in a minimal circle. Here we rederive the proof 
with emphasis on our case - continuous sequences and square- 
loss function. 

Definition 3: A circle is a cyclic closed set of L 
states/predictions {xt}f^i, if there are input samples {xt}f'^i 
that rotate the machine between these states. A minimal circle 
is a circle that does not contain the same state more than once. 
An example is depict in Figure [T] 




Fig. 1 . Five states minimal circle - aiTows represent the jump at each time 
t = l,...,5. 



3 



Theorem 1: The sequence that induces maximal regret over 
a given FSM, endlessly rotates the machine in a minimal 
circle. 

Proof: Let {xtYf^i be any sequence of samples and 
{^t}"=i the induced sequence of states/predictions on a k- 
states FSM, denoted U. Note that {a;t}"=i can be broken into 
a sequence of minimal circles, denoted {ciJl'li, and a residual 
sequence of transient states (which their number is less than k). 
A simple algorithm that generates this sequence of minimal 
circles works as follows - first search for the first minimal 
circle in the sequence, that is, the first pair i and j that satisfy 
Xi = Xj+i where all {itjj^j are different. Take out these 
states and their corresponding input samples {xtJl^i to form 
the first minimal circle ci. Repeat this procedure to construct 
a sequence of minimal circles. Note that at most k samples 
are left as a finite residual sequence. Now, denote the length 
of the minimal circle Ci by rii and the states and samples that 
form this circle by {xi,t}"li and {xi,t}^li, respectively. For 
now assume that there is no residual sequence, then the regret 
of the complete sequence satisfies 



R{U,x^) 



1 



< 



n 

t=i 

^ m Ui 

n ^ ^ 

i=l t=l 
i=l t=l 



{Xt - Xtf - {Xt - xf 



Xif ■ 



(5) 



(6) 



(7) 



where Xi = J2"=i ^i-t/^i is the empirical mean of minimal 
circle Ci. Let the regret of the minimal circle Ci be Ri, then 
we can write 



R{U,xl) < -Y^n,R, 



(8) 



Let the minimal circle with the maximal induced regret be Cj . 
Then this regret satisfies Rj > R{U,Xi). This is true since 
otherwise, that is, all Ri satisfy Ri < R{U,Xi), we get 



R{U,x^) < -Y^n^R, 

i=l 



<R{U,x'^) , 



(9) 
(10) 



which is clearly wrong. Thus, by further noting that foTn^k 
the regret induced by the residual sequence is neglectable, and 
there are finite number of minimal circles in a given FSM, the 
Theorem can be concluded. ■ 

III. Designing an optimal FSM with a small 

NUMBER OF STATES 

In this section we search for the best universal predictor 
with relatively small number of states. We start by presenting 
the optimal machines for a single, two and three states. The 
optimality is in a sense of achieving the lowest maximal regret 
using the allowed number of states. We then define in subsec- 
tion III-D a new class of machines termed the Degenerated 



Tracking Memory (DTM) machines. This class contains the 
optimal solutions presented for a single, two and three states. 



In subsection |III-E| a schematic algorithm for constructing 
the optimal DTM machine is given. A lower bound on the 
achievable (maximal) regret of any DTM machine is proven in 
subsection IIII-FI We conclude this section in subsection IIII-GI 
by presenting the tradeoff between number of states and regret 
achieved by the optimal DTM machine. We further discuss 
the fact that up to a certain number of states, this machine 
is optimal, not only among the class of DTM machines, but 
rather among all machines. 

A. Single state universal predictor 

The problem of finding the optimal single state machine has 
a trivial solution - from symmetry aspects, the optimal state 
is assigned with the value ^ and the worst sequence, all I's 
or O's, incurs a (maximal) regret of i? = j. 

B. Two states universal predictor 




Fig. 2. Two states machine described geometrically over the interval [0, 1]. 

A two States machine has two possible minimal circles - 
zero-step circle (staying at the same state) and two steps circle 
(toggling between the two states). The lowest maximal regret 
is achieved when the (maximal) regrets of both minimal circles 
are equal. Thus, let the lowest state be assigned with the value 
Si = \/R and a transition threshold 2\/R and the second 
state with 5*2 = 1 — VR and a transition threshold 1 — 2y/R. 
In that case, the regret of the zero-step circles is no more 
than R. Now, let us analyze the regret induced by a sequence 
Xi, X2, Xi, X2, ■■■ that endlessly rotate the machine in the 
two steps minimal circle. Since the regret is convex in the 
input samples, maximal regret is attained at the edges of the 
transition regions, that is, when xi = or xi = 1 — 2\/i? 
induces the down-step and a;2 = 1 or X2 = 2y/R induces 
the up-step (assuming that the machine starts at the highest 
state). Therefore there are four combinations that may bring 
the regret of this minimal circle to maximum. By computing 
these regrets one gets that the sequence 0, 1, 0, 1, ... incur the 
highest regret: R{U,x") = i? - 2VR + 3/4. Equalizing this 
regret to R results in R = (|)^ and the maximal regret of 
both minimal circles is equalized. Therefore the optimal two 
states machine can be summarized: 

• State values are: 

c _ 3 c _ ^ 



The states transition function satisfies 

if 



ip{l,x) 
<^(2,x) 



x< ^ 
otherwise 

if a;< ^ 
otherwise 
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The worst sequence that endlessly rotates the machine in 
one of the minimal circles incurs a (maximal) regret of R = 
(|)^ sa 0.14. Thus, if the desired regret is smaller than (|)^ 
we need to design a machine with more than two states. 



C. Three states universal predictor 




""■■■■-......■-^^ 1 

Fig. 3. Three states machine described geometrically over the [0, 1] axis. 



- Lower state 

■ Middle state 

- Higlier state 



With the same considerations as for the two states machine, 
the lowest state is assigned with Si = VJt and the upper state 
with ^3 = 1 — ^/R. From symmetry aspects, the middle state 
is assigned with 52 = 5. We also note that if a two states jump 
is allowed from the lower state to the upper state, the sequence 
0, 1, 0, 1, ... toggles the machine between these states. In that 
case, as was done for the two states machine, the incurred 
regret is no less than (|)^. Hence, only a single state jump is 
allowed, otherwise the three states machine has no gain over 
the two states machine. Thus, in the same manner as for the 
two states machine, one can get that the optimal three states 
machine satisfies: 

• State values are: 



Si = 0.3285 



Sy = 0.5000 



5, = 0.6715 



The states transition function satisfies: 

if a; < 0.6570 
otherwise 



if a; < 0.1715 

if 0.1715 < X < 0.8285 

otherwise 

if X < 0.3430 
otherwise 



The worst sequence that endlessly rotates the machine in 
one of the minimal circles incurs a (maximal) regret of R = 
0.1079. 

Figure [4] depict the states and the transition thresholds over 
the interval [0,1]. Note the hysteresis characteristics of the 
machine, providing "memory" or "inertia" to the finite-state 
predictor - an extreme input sample is needed for the machine 
to jump from the current state, that is, to change the prediction 
value. 

D. The class of DTM machines 

We now want to find a more general solution for the best 
universal predictor with a small number of states. We start 
by defining a new class of machines and then provide an 
algorithm to construct the optimal machine among this class. 
This optimality is in the sense of achieving the lowest maximal 
regret using the allowed number of states. The optimality of 
our algorithm among the class of DTM machines is being 



Fig. 4. Optimal three states machine described geometrically over the interval 
[0, 1] along with the transition thresholds of the lower state (dashed line), 
middle state (doted line) and upper state (solid line). The X's represent the 
value assigned to each state. 



proved. We Further show that for small enough number of 
available states, this optimal DTM machine is also optimal 
among all machines. 

Definition 4: The class of all fc-states Degenerated Track- 
ing Memory (DTM) machines is of the form: 

• An array of k states - {S't, , ...jS"!} are the states in the 
lower half (in descending order where Si is the nearest 
state to \ and Si < \ for all 1 < i < ki), {Si, Sk^) 



are the states at the upper half (in ascending order where 
§1 is the nearest state to ^ and Si > \ for all 1 < i < 
ku), where ki + k^ — k. 

• The maximum down-step in the lower half, i.e., from 
states {Skn ...,5*1}, is no more than a single state jump. 
The maximum up-step in the upper half, i.e., from states 
{Si, Sk,,} is no more than a single state jump. 

• A transition between the lower and upper halves is 
allowed only from and to the nearest states to ^, and 
§1 (implying that the maximum up-jump (down-jump) 
from Si (Si) is a single state jump). 

An example for a DTM machine is depict in Figure |5] Note, 
however, that the optimal solutions presented before for a 
single, two and three states, belong to the class of DTM 
machines. 




Fig. 5. An example of a DTM machine - note that a transition between the 
lower and upper halves is allowed only from (and to) Si and Si. Arrows 
represent the maximum up or down jumps from each state. 

Thus, two constraints define the class of DTM machines 
- no more than a single state down-step and up-step from 
all states in the lower and upper halves, respectively, and a 
transition between these halves is allowed only from and to 



5 



the nearest states to 5*1 and 5*1. These constraints faciUtate 
the algorithm for constructing the optimal DTM machine. 

E. Constructing the optimal DTM machine 

We now present a schematic algorithm for constructing the 
optimal DTM machine. Given a desired regret, Rd, the task 
of finding the optimal DTM machine can be viewed as a 
covering problem, that is, assigning the smallest number of 
states in the interval [0,1], achieving a regret smaller than 
for all sequences. We note that in an optimal fc-state machine, 
the upper half of the states is the mirror image of the lower 
half. The symmetry property arises from the fact that any 
sequence {xi, ...,a;„} can be transformed into the symmetric 
sequence {1 — xi, l~Xn}- Both sequences induce the same 
regret if full symmetry between the lower and upper halves is 
applied. Thus, assuming that the lower half is optimal in sense 
of achieving the desired regret with the smallest number of 
states, the upper half must be the reflected image to achieve 
optimality. Note that this property allows us to design the 
optimal DTM machine only for the lower half. 

The algorithm we present here recursively finds the optimal 
states' allocation and their transition thresholds. Suppose states 
{^i-i, S"!} in the lower half (in descending order where 
Si is the nearest state to i) and their transition threshold set 
{2Z i-ij ■■■iT.i] given and satisfying regret smaller than Rd 
for all minimal circles between them. Our algorithm generates 
the optimal Si, i.e., the optimal allocation for state z, and a 
threshold set, T j, satisfying regret smaller than Rd for all 
minimal circles starting at that state. 

We start by finding 5*1, the nearest state to ^ in the lower 
half, in the optimal DTM machine. 

Lemma 1: In the optimal /c-states DTM machine for a given 
desired regret Rd, 5i = ^ if fc is odd and 

5i =max{l-^i?d + i , 2+^d-'2^Rd + ^d + \] 
if k is even. 

Proof: From symmetry aspects 5*1 = ^ in the optimal 
DTM machine with odd number of states, otherwise there are 
more states in one of the halves and the symmetry property 
presented above does not hold. For even k, the nearest state 
to i in the upper half. Si, is the mirror image of Si, hence 
Si = 1 — 5i. By definition, only a single state up-jump is 
allowed from Si and only a single state down-jump is allowed 
from Si. Thus, the machine can be rotated between these 
states, constructing a two steps minimal circle. Denote by 
xi and X2 the samples that induce the up and down jumps, 
correspondingly. These samples must satisfy the transition 
thresholds, i.e., 

+ \/Rd < xi <l 
()< X2 <Si-^/Rd = l-Si-^/Rd . (11) 

Since the regret is a convex function over the input samples, 
the regret of a minimal circle is brought to maximum by sam- 
ples at the edges of the constraint regions. Thus, in a two steps 
minimal circle there are four combinations that may maximize 
the regret and need to be analyzed. By examining the regrets 



in all four cases we get that 5i must satisf y two constraints 

Si > 1 - ^Rd + I and Si > 2 + ^i-2^ Rd + + ^. 

m 

Note that Si must satisfy 5i < ^ which does not hold for 
low enough Rd, implying a lower bound on the achievable 
regret of the optimal DTM machine (see section |III-F[ ). 

Now, after presenting the starting state of the algorithm, we 
present the complete algorithm for constructing the optimal 
DTM machine: 

1) Set i = 1 and the corresponded starting state Si (see 
Lemma [7J. Set the maximum up-step from the starting 
state mu,i = 1- 

2) Set the next state index i — i + 1. 

3) Set the maximal up-step from state i to m = 1. Find the 
minimal value that can be assigned to that state with 
valid threshold set ( in sequel we present an algorithm for 

finding a valid threshold set). Denote this value by Si^m 
and the threshold set by T ^ Repeat this procedure 

for all m = 1, . . . ,i ~ I (a jump o/ i — 1 states from 
state i brings the machine to state Si. Remember that 
an higher jump is not allowed in a DTM machine). 

4) Choose the minimal Si,m among all possible maximum 
up-steps, that is, set 

mu,i = arg min 5,,™ 

l<m<i-l 
Si =^ Si^m^ i 

T 

Thus we have set the parameters of state i: assigned 
value Si, maximum up-jump of m.^^ i states and transi- 
tion thresholds T ^. 

5) // 5, > ^/Rd go to step (2). 

6) Set the upper half of the states to be the mirror image 
of the lower half. 

Explanations and Comments: 

• For a given desired regret Rd, one should run the algo- 
rithm presented above twice - for odd and even number 
of states with the corresponded starting state. Si. The 
optimal DTM machine is the one with the least states 
among the two (differ by a single state). 

• Note that transition thresholds for state 1 are need to be 
given - a single state up-jump if the input sample satisfies 
a; > 5*1 + \/Rd and a single state down-jump if the input 
sample satisfies x < Si — \/Rd- These are the optimal 
transition thresholds since as the interval for transition 
is wider the number of possible worst sequences in 
other minimal circles decreases. Furthermore, with these 
transition thresholds the maximal regret of a zero-step 
minimal circle (staying at Si) is Rd. 

• A valid threshold set for state z is a set of transition 
thresholds that satisfy regret smaller than Rd for all 
minimal circles starting at state i. 

To complete the construction of the optimal DTM machine, 
we still need to present an algorithm for finding the optimal 
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transition thresholds at each iteration (Step (3)). Consider 
states 5i} in the lower half and their transition 

threshold set {T ...,T^} are given and satisfying regret 
smaller than Rd for all minimal circles between them. Suppose 
also Si and m are given, where m denotes the maximum up- 
step from state i. Note that there are m + 1 minimal circles 
starting at state i (depict in Figure |6]i: 

• Zero-step minimal circle (staying at state i). 

• For any 2 < j < m+1, a minimal circle of j steps - one 
up-step (of j — 1 states), j — 1 down-steps (of a single 
state). 

Also note that these m+1 minimal circles are within the lower 
half, that is, within the states {S'i_i, 6*1} (see Step (3)). 




Fig. 6. m + 1 possible minimal circles starting at Si, where m is the 
maximum up-step from state i. 

Let x{ be the samples that endlessly rotate the machine 
in a minimal circle, where xi induces the up-step from state 
i and induce the down-steps. Since the regret is convex 
in the input samples, the samples X2 that bring the regret to 
maximum are at the edges of the transition regions, that is, 
satisfying 



Xt = Xt 



Rd OT xt^O V 2<t < j 



(12) 



Thus, there are 2^^^ combinations of X2 that may maximize 
the regret. Now, given x^. Lemma [2] below provides upper 
(Cu{x2)) and lower (^(xj)) bounds on xi so that in this 
region the induced regret is smaller than Rd- Therefore, by 
computing these bounds for all 2-'"^ combinations of Xj, one 
may find a region for xi in which the regret is lower than Rd 
for all of these combinations. This region may be given by 



Ci = max Ci{x2) < xi < niin Ch{x2) 



Ch 



(13) 



where Aj is the set of 2-'^^ combinations of Xj. Note that this 
interval is valid only if Ci < Ch- In that case we can say that 
the maximal regret of this minimal circle is guaranteed to be 
lower than Rd and conclude that the transition thresholds for 
a j — 1 steps up-jump from state i must satisfy 



(14) 



Going over all minimal circles, 2 < j < to + 1, results upper 
and lower bounds, (7; and C/j, for each transition threshold. 
Thus, if a threshold set can be found to satisfy all bounds and 
to cover the interval [Si + \/Rd , 1] (that is, Ti_„i > 1 and 
Ti.Q < Si + ^/Rd), we say that valid transition thresholds for 
state i were found, otherwise - there are no valid thresholds 
for the given Si and m. 

Lemma 2: Consider a sequence x\ that rotates a DTM 
machine in a minimal circle starting at state i- Given states 



{Si, . . . , S'i-j+i}, the regret is smaller than Rd if xi satisfies: 

a{x2) — Hxii) < xi < a{x2) + ^(xa) , 

where: 



i{x{) ^ Si + ^{Si - Xt) , 



t=2 



\ 



1 ^ 

Rd : — Si){Si^j+t-l + Si — 2X() 



t=2 



(15) 



Proof: Analyzing the regret of the sequence and claiming 
for regret smaller than Rd results the constrain on xi: 



1 

^ V[(xt-xt)'-(xt-x)2] <Rd , 
7 — 

where xi = Si and xt = Si-j+t-i for 2 < t < j. 



(16) 



We can now present the algorithm for finding a threshold 
set for state i given Si and m, the maximum up-step: 

1) Find Cj^i and Cj^t for all 2 < j < m + 1 as follows: 



C 



max 



Cj^h = min 



a(4) - K^i) 



a{xl)+b{xi) 



(17) 



where a(x2) and 6(x2) are given in \\5) and Aj is the 



set of 2^ ^ combinations of x 

Xt = Si^j+t~i 



/Rd or xt = V 2<t< j . 

(18) 

2) If one of the following constraints does not hold, return 
and declare that there are no valid thresholds. 

Cjj < Cj,h V 2 < J < TO , 

Cj+i.i < CjM y '2 < j < m , 

C2A < Si + \/Rd , 

1 < C™+i,^ . (19) 

3) Find a valid monotone increasing transition thresholds 
{Ti,Q, Ti^rn} that satisfy: 



Cj+i,i < Ti j^i < Cj^h 

C2U < Ti^o < Si + \/~Rd 
r. 



V 2 < j < TO 



1 < Ti_„i < Cm^iji 



(20) 



4) Set the transition thresholds for the down-step 

{0, Si - ^/iTd}- 
Explanations and Comments: 

• Cj^i < Cj,h must be satisfied otherwise there is no xi that 
satisfies regret smaller than Rd for all 2^^^ combinations 

of X2- 

• Cj_|_i ; < Cjj^ must be satisfied otherwise there is no 
Ti j_i satisfying both j_i < Cj^h and Cj+i^i < 



T, 
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• Tifl < xi < Ti l induces a single state up-jump, hence, 
Tifi must satisfy C2.1 < Ti^. Also T,;,o must satisfy 
Tifi < Si + \fRd to ensure regret smaller than Rj, for 
zero-step minimal circle (staying at state i). 

• Ti < 2:1 < Ti^rn induccs m states up-jump, hence, 
Ti m must satisfy ,„ < Cm+i.h- The transition thresh- 
olds must cover the interval [Si + ^/Rd, 1], therefore T^.m 
must also satisfy 1 < 

• This algorithm provides threshold set given the states 
{Si^i, 5*1} and m, the maximum up-step from state i. 
It also requires Si. Recalling the algorithm for finding Si 
- we search for the minimal Si^m with a valid threshold 
set for a given m. Thus, one can provide high „ and 
reduce it until no valid threshold set can be found. 



lower bounded by 

R = (i)2 ^ 0.0278 . 

Proof: In an optimal fc-states DTM machine, where k is 
even, the starting state 5*1, must satisfies 

Si = max{l-y/i?rf + l,2+^d-2^ Rd + + \} < 5, 

(21) 

implying that if the desired regret satisfies \/Rd < ^, then 
5i > ^ and no DTM machine with even number of states can 
be formed. We then conclude that also a DTM machine with 
odd number of states can not be formed (since otherwise a 
sub-optimal DTM machine with even number of states could 
have been formed by adding another state). ■ 



Theorem 2: The algorithm given in this section constructs 
the optimal DTM machine for a given desired regret, Rd, i.e., 
has the lowest number of states among all DTM machines 
with maximal regret smaller than Rd- 

Proof: In each iteration the algorithm finds the minimal 
Si with a valid threshold set. Note that in DTM machines 
the transition thresholds for up-steps, {T^ 0, T^.m^ do 
not have an impact on regrets of minimal circles other than 
those starting at state i. Thus, given Si, the optimality of these 
thresholds is only in the sense of satisfying regret smaller 
than Rd for these minimal circles. As for the down thresholds 
- an input sample x induces a down-step from state s if 
satisfies < a: < Ts-i- As Ts.-i is smaller for all states 
s = i — 1, 1 the achievable Si with a valid threshold set 
is smaller (the constrains are more relaxed). We choose the 
smallest Ts _i for all states, i.e., Ss^^/Rd- Furthermore, each 
Ss is chosen to be minimal. We further show that optimality 
is achieved when assigning the minimal value for all states. 
Consider , 6*1} in the lower half are the outputs of 

the algorithm For a given desired regret Rd- Let us examine the 
case where the assigned value for state i — 1 is Si-i satisfying 
Si-i > Si^i. We note that the value assigned to state i — 1 has 
no impact on the optimality of states i — 2, 1. Furthermore, 
the constrains on the up thresholds of state i depend only on 
Ss — Si or Sg — Sf, where s = i — 1, 1 (applying xt = 
or Xt — Si-j+t-i — \fRd in Equation ([T5|l). Since 5*^ is the 
minimal value with valid thresholds for {Si-i, Si} , the 
minimal value with valid thresholds for {S'i„i, S'i_2, S*!} 
is not smaller than Si. This holds for all states and 
therefore, choosing Si-i does not reduce the number of states. 

Thus, in all aspects optimality is achieved at each iteration 
in the algorithm by assigning state i with the minimal value 
Si, down thresholds {0, Si — \fRd} and valid up thresholds. 



F. Lower Bound on the Maximal Regret of DTM Machines 

Here we show that any DTM machine can not attain a 
maximal regret lower than (g)^. The constraints imposed on 



G. Conclusions 

In Figure |7] we present the number of states vs. maximal 
regret of the machines constructed by the algorithm presented 
above. Note how the optimal machine can not attain a maximal 
regret smaller than 1/36. 



- DTM Machine 
Lower Bound 




30 

Num. of States 



this class of machines (as described in section III-D 1, yield 
this lower bound. 

Theorem 3: The maximal regret of any DTM machine is 



Fig. 7. Performance of the optimal DTM machine. 

In this section we started by presenting the optimal solution 
for machines with a single, two and three states. These 
solutions belong to the class of DTM machines. Furthermore, 
one can validate that our algorithm generates for these number 
of states machines that are identical to these optimal solutions. 
Thus, in addition to Theorem |2] we can conclude that up to a 
certain number of states, our algorithm generates the optimal 
solution among all machines. This number, however, is yet 
unknown. 

IV. The Exponential Decaying Memory machine 

In the previous section we studied the case of tracking the 
empirical mean when small number of states are available. In 
the rest of the paper we shall examine the case of large number 
of states. We start by proposing the Exponential Decaying 
Memory (EDM) machine. This machine was presented in | |T3| 
as a universal predictor for individual binary sequences. It 
was further shown that with k states it achieves an asymptotic 
regret of 0{k^^/^) compared to the constant predictors class 
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and w.r.t. the log-loss (code length) and square-error functions. 
Here we start by describing and adjusting the EDM machine 
for our case, predicting individual continuous sequences. 

Definition 5: The Exponential Decaying Memory machine 
is defined by: 

• k states {Si, ...jSk} distributed uniformly over the inter- 
val [A:-i/3^i_yfc-i/3], 

• The transition function between states satisfies: 

xt+i = Qixtil ~ k^^l^) + ^ (22) 

where xt is the prediction (state) at time t and Q is the 
quantization function to the nearest state. 

Note that the spacing gap between states, denoted A, 
satisfies: 

A - i-l^ ^ , (23) 

and the quantization function satisfies Q(y) — ^t+i, if y 
satisfies it+i — < y < it+i + ^A. Also note that the EDM 
machine is a finite-memory approximation of the Cumulative 
Moving Average predictor given in Equation (|2]), where ^^^^ 
is replaced by the constant value k^'^/'^ (which was shown to 
be optimal in [|13|). 

We now present asymptotic bounds on the regret attained by 
the EDM machine when used to predict individual continuous 
sequences. 

Theorem 4: The maximal regret of the fc-states EDM ma- 
chine, denoted Uedm^^ attained by the worst continuous 
sequence, is bounded by 

Proof: Consider L length sequence {xt}t^i that endlessly 
rotates the machine in a minimal circle of L states {xt}fL^. 
The input sample at each time t can be written as follows 

xt=xt + (PtA + 5t)k''^ , (24) 

where Pt E Z denotes the number of states crossed by the 
machine at time t, 6t is a quantization addition that satisfies 
\5t\ < 5 A and has no impact on the jump at time t, i.e., has 
no impact on the prediction at time t + 1. Since we examine 
a minimal circle, the sum of states crossed on the way up 
is equal to the sum of states crossed on the way down, i.e 
^f^i Pt — 0. This means that the empirical mean of the 
sequence is 

X ^ -Y^ixt + Stk'^'') . (25) 
t=i 

Now, we can write 

1 ^ 

R{Uedm, ,x{)^ -^(xt-xtf - {xt - xf (26) 
t=i 

1 ^ 

= ^2 + _ ^(^2 „ 2xtxt) . (27) 



By Jensen's inequality we have ^2 < Y.t=i{xt+5tk^''^f/L. 
Applying this and (|24| into Equation (|27| yields 

L L 

R(Uedm,,x^) < iY.^tk^^' - lY.'^PtAk^/'xt . (28) 

The first term on the right hand side depends only on the 
quantization of the input samples, St, thus we temi it quan- 
tization loss. The second term depends on the spacing gap 
between states. A, thus we term it spacing loss. Hence, the 
regret of the sequence is upper bounded by a loss incurred 
by the quantization of the input samples and a loss incurred 
by the quantization of the states' values, i.e., the prediction 
values. By applying \6t\ < |A we bound the quantization 
loss: 

L 

quantization loss = j^^^Sfk"^/^ < \k^'^/^ . (29) 
t=i 

Let us now upper bound the spacing loss. We define sub- 
step as a a single state step that is associated with a full 
step. For example, a step at time t of Pt > states 
consist of Pt sub-steps. We denote these up sub-steps by 
{USSt^i, . . . , USSt.Pt}. Note that all of them are associated 
with a full up-step from state xt. Since in a minimal circle the 
number of states crossed on the way up and down are equal, 
we can divide all sub-steps into pairs of up and down sub- 
steps that cross the same state. For example, an up sub-step 
USStj is paired with a down sub-step that crosses the same 
state. The up sub-step is associated with a full up-step from 
state Xt. The paired down sub-step is associated with a full 
down-step from a state which we denote by X(jsSt , ■ Noting 
that Pt is positive for up-steps and negative for down-steps, 
we can write 

t—\ iG{up steps} tG{down steps} 

Pt 

tG {up steps} J=l 

Now, up sub-Step U SSt.j crosses one of the states between 
Xt and Xt + PtA. The paired down sub-step has to cross the 
same state. Since the farthest up or down-step in an EDM 
machine is fc^^^'^, we can conclude that the paired down sub- 
step is associated with a full down-step from a state that satisfy 
xusSt J < it + -PtA + fc^^/^. By applying this into Equation 
( |30l ) we get 

-iT.PA<i E Pt(PtA + ^-^/3)<2^, 

t—1 f ^ {up Steps} 

(31) 

where in the last inequality we used Pt < ^ (since the 
farthest step is k^^/^). The spacing loss, thus, satisfies: 

L 

spacing loss = 2Afc2/3(_i ^ p^xt) < 4A:"2/3 _ ^32) 

t=i 

By using Theorem[T] the upper bound is proven. The proof for 
the lower bound is given in Appendix |l] where we show that 
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there is a sequence that endlessly rotates the fc-states EDM 
machine in a minimal circle, incurring a regret of ^fc"^/'^ + 
O(fc-i). 

■ 

Note that Theorem]?] implies that the fc-state EDM machine 
achieves a regret smaller than ^k^^^^ for any individual 
continuous sequence bounded in [0,1]. Moreover, the regret 
of the worst sequence, that is, the maximal regret, is at least 

ifc-2/3+0(/fc-l). 

In Figure]8]the number of states vs. maximal regret achieved 
by the EDM machine is plotted (regret of ^fc^^/^). Also 
plotted is the performance of the optimal DTM machine. Note 
that it outperforms the EDM machine for small number of 
states. Nevertheless, while the achievable (maximal) regret of 
the optimal DTM machine is lower bounded, the EDM can 
attain any vanishing regret with large enough number of states. 



only in a bounded number of states - the lowest possible state 
is bounded from below by the maximum down-jump from 
the nearest state to x and the highest possible state is upper 
bounded by the maximum up-jump from the nearest state to 
X. Therefore, the TS{x) endlessly rotates the machine in a 
finite number of states, thus inducing a minimal circle. Since 
the regret induced by the monotone sequence is neglectable, 
this part can be ignored, and therefore we shall assume that 
any TS{x) endlessly rotates the machine in a minimal circle, 
without the monotone part. 

Lemma 3: Consider an FSM with maximal regret R. A 
TS{x) induces a minimal circle where at least half of its states 
are within ^ from x for any x < ^ and for any x > ^ . 

Proof: Let us examine the regret of a TS{x), where x < 
i, that rotates an FSM, denoted U, in a minimal circle of 
length L. Since the empirical mean of the sequence, x, induces 
the minimal square error, the regret satisfies 



- Optimal DTM Machine 
DTM's Lower Bound 

- EDM Machine 



20 40 60 80 100 120 140 1BD 180 200 
Num. of States 



Fig. 8. Performance of EDM and optimal DTM machines. 



RiU, x\) > ^ ^(Xt - Xtf - (xt - xf 

t=l 

L 

> i^2(a;-it)(2;t-^) • (33) 



We note that by construction, {x — xt){xt — x) is positive for 
all t. Moreover, since x < ^ and xt — 1 for up-steps and 
Xt — for down-steps, it follows that: 



R{U,xi) > ^;J2^\X-Xt\x 



(34) 



Hence half of the states have to be within — from x, otherwise 
we get a regret higher than R. In the same manner it can be 
shown that for a; > | half of the states have to be within 
from X. 



V. Lower bound on the achievable maximal 

REGRET OF ANY fc-STATES MACHINE 

In section ]lll] we have analyzed machines with relatively 
small number of states. We then examined the case of large 
number of states and proposed the EDM machine as a univer- 
sal predictor. We showed that asymptotically, using enough 
states, it can achieve any vanishing regret. However, is it the 
optimal solution? Does it attain a desired (maximal) regret 
with the lowest number of states? In this section we present 
an asymptotic lower bound on the number of states used by 
any machine with maximal regret R. 

Definition 6: Given a starting state Si, a Threshold Se- 
quence X, denoted TS{x), is constructed for any x in the 
following manner - if the current state is smaller than x, next 
sample in the sequence is 1 (inducing an up-step), if not, next 
sample is (inducing a down-step). 

For any starting state and any x, the constructed TS{x) 
induces a monotone jumps to the vicinity of x and than rotates 
the machine in a minimal circle. If the starting state is below 
X, the TS{x) induces monotone up-steps until the machine 
crosses x (or monotone down-steps if the starting state is 
above x). In the vicinity of x the TS{x) rotates the machine 



Lemma 4: Consider an FSM with maximal regret R. The 
maximum number of states crossed in an up-step and in a 
down-step from state Si, for any i, must satisfy 



ma. 



> 



> 



l^(5, + 7fl) 



2VR 

-Vr 



(35) 
(36) 



Proof: See Appendix jll] 



Note that Lemma ]4] implies the same lower bound on 
the achievable regret of any DTM machine, R > (g)^ (as 



presented in section IIIi. Any DTM machine allows only a 



single state down-jump from all states below ^ . Thus, a DTM 
machine may attain maximal regret R if all states below ^ 
satisfy Equation (]36| with md.i = 1, hence: 



i^<i 



(37) 



Furthermore, Lemma ]4] provides a lower bound on the max- 
imal regret of any machine that allocates a state Si with 
maximum up and down jumps of m„ ^ and nidA states. 
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Theorem 5: The number of states in any deterministic FSM 
with maximal regret R, is lower bounded by 



n-3/2 
24-'*- 



0{R- 



Proof: Consider a fc-states machine with maximal regret 
R. Lemma |3] implies that for any x < ^ there is a TS{x) 
that forms a minimal circle in the vicinity of x where at least 
half of the states are within — from x. Since the samples of 
the TS{x) are either or 1, the constructed minimal circle is 
of at least m„,i states, where rriu^i is the maximum up-jump 
from the nearest state to x, denoted state i. Thus, there are at 
least |m„ j states within ^ from x. Lemma |4] implies that the 

maximum up-step from state i is at least m^.i — \ '^~^^^^^^~\ 
states, where Si is the assigned value to state i. 



We define the interval B{mu) as all x's satisfying 



2VR 



(38) 



In other words, Biniy) is the interval 

(1 - ^/{R){2m^ + 1), 1 - y(i?)(2m„ - 1)] . 

Note that the length of this interval, |i3(m„)|, is always equal 
to 2y/R. Now, let A^i be the largest integer to satisfy 1 — 
y/{R) {2Ni - 1) > i, and N2 be the smallest integer to satisfy 
1 - y/{R){2N2 + 1) < 0. We then can write 



N2 



U B{mu) D [0, 



(39) 



where we note that {B{Ni), . . . , B{N2)} are non-intersecting 
intervals. Also note that the smallest value in B{Ni) (that is, 
1 - ^{R){2Ni + 1)) is greater than ^ - 2v^. In the same 
manner, the smallest value in B{Ni +i) (where i is a positive 
integer), is greater than ^ — 2y/R{i + 1). 

For x € B{mu) there are at least states within ^ from 
x. Therefore, in the interval B{mu) there are at least 



mm 

a:6_B(m„) 



R/x 2' 



States. Using the fact that in an optimal machine the minimal 
number of states in the lower and upper halves is equal (see 
Section |III-E[ ), we can conclude that k, the number of states, 
satisfies 



k>2 ™n 



JV2-I 

> R^^ } min xil ~ x ~ Vi?) . (40) 

m„=iVi+l ^ ' 



The function — a;— vi?)is concave and has a single 
maximum point at 5(1 — VR)- Thus, min3;gB(m„) x{l — x — 
V R) is attained at the smallest value in the interval B{mu) 
(that is, 1 — y^i?)(2r7i„ + 1)). As was mentioned before this 



value is greater than ^ — 2y/R{mu — A^i + 1) and therefore 



this further minimizes the function x{l — x — VR). Thus, we 
can write 

L1/(4VK)J 

k>lR-^^^ 2Vr{^ ~ 2y/Ri){^ + 2VRi - VR) 



(41) 



This concludes the proof ■ 
Note that Theorem |5] implies that a fc-states FSM can not 
attain maximal regret smaller than 



(24fc)-2/3 + o(fc-i) 



(42) 



VI. Enhanced Exponential Decaying Memory 

MACHINE 

In Section|lV]we showed that the EDM machine can achieve 
any maximal regret, as small as desired. In this section we 
present a new FSM named the Enhanced Exponential Decay- 
ing Memory (E-EDM) machine. We prove that it outperforms 
the EDM machine and better approaches the lower bound 
presented in the previous section. 

A. Designing the E-EDM machine 

The algorithm for constructing the E-EDM machine for a 
desired regret Rd, is as follows. 
. Set i? ^. 

• Divide the interval [0, 1] into segments, denoted 
A{mu,md), where each contains all x's satisfying both 



rriu = \ 



l-x-VR 

2\/fl 
x^j/Rl 



1 , 



(43) 



Note that these segments are non-intersecting. 

Linearly spread states in each segment A{mu,md) with 

a A{mu,md) spacing gap between them, where 



A(m„,md) 



2mu-'>Tici 



(44) 



Assign all states in segment A{mu,md) with maximum 
up and down jumps of rriu, rud states, correspondingly. 
Note that according to Lemma |4] these are the minimal 
maximum jumps allowed in order to achieve maximal 
regret smaller than R. 

Assign transition thresholds for each state i as follows: 

= 5, + (2j + l)Vi? V -md,r <j< m„,, , (45) 

that is, if the machine at time t is at state i, it jumps j 
states if the current outcome, Xt, satisfies: 



+ (2j - l)y/R < Xt < S, + {2j + l)^/R 



(46) 



Note that as required, the transition thresholds cover the 
[0, 1] axis (arises from the chosen maximum up and down 
jumps). 

We further need to guarantee the desired regret when 
the machine traverses between segments. Consider two 
adjacent segments A(m„ i,TO(J 1) and 2) "^^,2) ™d 

suppose the spacing gap in the second segment is smaller 
Add states to the first segment such that the closest 
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mu,i states to the second segment have a spacing 

gap of A(to„.2, ™d,2)- It can be shown that at most two 
states need to be added to each segment. Figure |9] depict 
the spacing gap in two adjacent segments. 

{ni +111 . )-A, 



"1 "1- 



A, A, 



-^2 --^l ^2 



Fig. 9. Spacing gap of the E-EDM macliine. Adjacent segments 
yl(m„,i,m,d,i) and A(m„,2,md,2) with spacing gap As = 2m^f^a s 
where s = 1, 2 and A 2 < Ai. Note that the spacing gap between the highest 
mu,! + rud^i states in segment A{mu,i,'md,i) is A2 while the maximum 
up and down jumps from these states are mu,i and m^.i states. 



Recall that the transition thresholds in the EDM machine 
are T^j = S, + {j + ^)Ap/^. Since A fc"\ if we take the 
desired regret to be Rd = \k^'^^^, that is, R — ^k"'^/^, we 
get that the transition thresholds in the E-EDM machine are 
identical to those defined for the EDM machine. Furthermore, 
recall that according to Theorem |4] the maximal regret of 
the fc-states EDM machine is greater than i/c^^/^. Thus, the 
new machine presented here achieves lower maximal regret 
by better allocating the states - the states of the EDM are 
uniformly distributed over the interval [0, 1] while in the E- 
EDM machine the interval [0, 1] is divided into segments and 
states are uniformly distributed with a different spacing in each 
segment. This will be proved more rigorously in sequel. 

We shall now prove that the maximal regret in an E-EDM 
machine, constructed by the algorithm above, indeed is no 
more than the desired regret Rd- 

Theorem 6: The construction of the E-EDM machine ac- 
cording to the algoi'ithm |VI-A[ yields a machine with maximal 
regret that is no more than Rd- 

Proof: Consider a sequence that endlessly rotates the 
E-EDM machine (denoted Ue-edm) in ^ minimal circle of 
L states . Each input sample xt can be written as follows: 



Xt = Xt 



2VR -Pt + St 



(47) 



where Pt is the number of states the machine crosses at time 
t {—md < Pt < and 5t satisfies St < V R and can be 
regarded as a quantization addition that has no impact on the 
jump at time t, i.e., has no impact on the next prediction. Since 
we examine a minimal circle, the sum of states crossed on the 
way u£ is equal to the sum of states crossed on the way down, 
i.e J2t=i ~ ^- applying this and Jensen's inequality, the 
regret of the sequence satisfies: 



R{Ue-edm.x[) <iJ2^t 



t=i 



t=i 



(48) 



We term the first loss in the right hand side of Equation ( |48l ) 
quantization loss (since it depends only on St, the quantization 



of the input sample, xt). By applying St < v ^ we get: 

L 

quantization loss — S^ < R . 



(49) 



t=i 



We term the second loss in the right hand side of Equation 
( |48| ) spacing loss (since xt — xi depends only on the spacing 
gap between states). Thus, as we sowed for the EDM machine, 
the regret of the sequence is upper bounded by a loss incurred 
by the quantization of the input samples and a loss incurred 
by the quantization of the states' values, i.e., the prediction 
values. 

Lemma 5: For any sequence that endlessly rotates the 
E-EDM machine in a minimal circle of states if^, where the 
spacing gap between all states is identical, the spacing loss is 
smaller than R satisfying: 



L 

spacing loss = —A^/Rj^ Pt{xt — xi) < R . 
t=i 



(50) 



Proof: See Appendix III 



Lemma 6: For any sequence x^ that rotates the E-EDM 
machine in a minimal circle of states if, where the spacing 
gap is not equal between all states, the spacing loss is smaller 
than R satisfying: 



spacing loss 



L 



R^^Ptixt 



xi) <R 



Since R - 
concluded. 
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Proof: See Appendix IV ■ 
f and by applying Theorem [T| the proof is 



B. Performance Evaluation 

The following Theorem gives the number of states used by 
an E-EDM machine designed with a desired regret Rd- 

Theorem 7: The number of states in an E-EDM machine 
designed to achieve maximal regret smaller than Rd is 

Proof: See Appendix [V] ■ 
Theorem |4] implies that the asymptotic worst regret of the 
fc-states EDM machine is at least \k~'^l'^ . Thus, the number 
of states in an EDM machine with maximal regret Rd, is at 
least (2Rd)^^^'^ states. Theorem [s] implies that the asymptotic 
number of states of any deterministic FSM with maximal 
regret Rd is at least ^Rd^^^- Theorem [t] implies that the 
asymptotic number of states in an E-EDM machine with 
maximal regret Rd is ^i^)^^^"^ ■ Thus we can conclude that: 
1) For a given desired regret, the E-EDM machine outper- 
forms the EDM machine in number of states by a factor 
of: 



23/2 



-3/2 



(2-R,d 



1-3/2 



i.e., uses only | of the states needed for the EDM 
machine to achieve the same maximal regret. 
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Fig. 10. Comparing the performance of tlie E-EDM machine, the EDM 
machine and the lower bound. 



2) For a given desired regret, the E-EDM machine ap- 
proaches the lower bound with a factor of about: 



12 "-d 

— n-^l'^ 
24^d 



In Figure 10 we plot the (maximal) regret attained by the 



EDM and E-EDM machines as a function of the number of 
states, together with the lower bound given in Theorem [5] 
Note that for a large number of states the E-EDM machine 
indeed outperforms the EDM machine by a factor of ^ | and 
approaches the lower bound with a factor of ~ 6. 



VII. Summary and conclusions 

In this paper we studied the problem of predicting an 
individual continuous sequence as well as the empirical mean 
with finite-state machine. 

For small number of states, or equivalently, when the desired 
maximal regret is relatively large, we presented a new class of 
machines, termed the Degenerated Tracking Memory (DTM) 
machines. An algorithm for constructing the best predictor 
among this class was given. For small enough number of 
states, this optimal DTM machine was shown to be optimal 
among all machines. It is still unknown up to which number 
of states this result holds true. Nevertheless, for larger number 
of states, one can try to attain better performance by easing 
the constraints imposed on the class of DTM machines and 
allowing more than a single state down-jump (up-jump) from 
all states in the lower (upper) half. The construction of the 
optimal machine in that case is, however, much more complex. 
Another important implication of these restrictions, is a lower 
bounded of i? = 0.0278 on the achievable maximal regret of 
any DTM machine. 

For universal predictors with a large number of states, or 
equivalently, when the desired maximal regret is relatively 
small, we proved a lower bound of 0(fc^^/'^) on the maximal 
regret of any fc-states machine. We proposed the Exponential 
Decaying Memory (EDM) machine and showed that the worst 
sequence incurs a bounded regret of 0{k^^^^), where k is the 



number of states. We further presented the Enhanced Exponen- 
tial Decaying Memory (E-EDM) machine which outperforms 
the EDM machine and better approaches to the lower bound. 
An interesting observation is that both machines are equivalent 
up to the prediction values, where a better state allocation is 
preformed when constructing the E-EDM machine. Recalling 
that the EDM machine is a finite-memory approximation 
of the Cumulative Moving Average predictor which is the 
best unlimited resources universal predictor (w.rt. the non- 
universal empirical mean predictor) |2|, we can understand 
why both the EDM and the E-EDM machines approach 
optimal performance. 

Analyzing the performance of the EDM and the E-EDM 
machines showed that the regret of any sequence can be upper 
bounded by the sum of two losses - quantization loss, the loss 
incurred by the quantization of the input samples, and spacing 
loss, the loss incurred by the quantization of the prediction 
values. It is worth mentioning that the maximal regret of the 
optimal DTM machine can also be upper bounded by the sum 
of these losses. As the number of states in the optimal DTM 
machine increases, the quantization loss goes to the lower 
bound, R — 0.0278, and the spacing loss goes to zero. Thus, 
understanding the optimal allocation between these two losses 
may lead to the answer of up to which number of states the 
optimal DTM machine is the best universal predictor. It is also 
worth mentioning that the E-EDM machine is constructed with 
allocating half of the desired regret to the quantization loss and 
the other half to the spacing loss. A further optimization may 
be obtained by a different allocation. 

Throughout this paper we assumed that the sequence's 
outcomes are bounded. Note that this constraint is mandatory 
since the performance of a universal predictor is analyzed 
by the regret of the worst sequence. In the unbounded case, 
for any finite-memory predictor one can find a sequence that 
incurs an infinite regret. However, an optional further study is 
to expand the results presented here to a more relaxed case, 
e.g. sequences with a bounded difference between consecutive 
outcomes. 

In this study we essentially examined finite-memory uni- 
versal predictors trying to attain the performance of the (non- 
universal) "zero-order" predictor, i.e., the empirical variance 
of any individual continuous sequence. We believe that our 
work is the first step in the search for the best finite-memory 
universal predictor trying to attain the performance of the best 
(non-universal) L-order predictor, for any L. 



Appendix I 

Proof of the lower bound given in Theorem|4] 

Proof: Here we show that there is a continuous-valued 
sequence which rotates the EDM machine (denoted Uedm) 
in a minimal circle incurring a regret of ^k^"^/^ + 0{k~^). 

Consider the following minimal circle - m states up-step, 
m—1 states down-step, m states up-step, to — 1 states down- 
step and so on TO — 1 times. The last step is a down-step of 
TO — 1 states that close the circle and return the machine to 
the initial state. Denoting the states' gap by A, the described 



13 



sequence can be written as follow^ 

xi = ii + (to + i)Afc^/^ 

1 - i)Afc2/3 



X2 = Xi+mA—{m ^ 
X3 ^ xi + A + {m + i)Afc2/3 



X2m-3 = xi + (to - 2)A + (m + i)AA:2/3 

X2rn-2 = Xi + (2to - 2)A - (to - 1 - ^)A0^ 

X2m-i + (to - 1)A - (m - 1 - i)Afc2/^ . 

Now, assuming that all of these sample are between and 1, 
one can note that they form a minimal circle of 2to — 2 states 
{xi, . . . ,X2m-i} with equal A spacing between them. The 
circle is as follows: xi ^ ^m+i X2 ^ ^3 ^ 

. . . ^ X2m-i n- xi, whcrc ^ and i->- denote up and 

down-step, accordingly. 

Analyzing the regret of the described sequence results in 



RiUEDM,x1"^ 



-1) = A2(ifc4/3 ^ ^(^ _ 1)^2/3 _ m(™-l) 



Let us choose 



1 1,-2/3 
I 2'^ 

L A 



J , 



(51) 



(52) 



where [x\ denotes the rounding of x to the largest previous 
integer. In that case the highest sample, X2m-3, satisfies 
a;2m-3 < xi + l/2fc-2/3 _ 2A + 1/2 + l/2fc-i/^ and the 
lowest sample X2m.-i, satisfies X2m~i > xi + l/2fc^^/'^ — 
2A - 1/2 + 3/2fc-i/3. Choosing, for example, 

xi = Q(i-i/fc-i/3-ifc-2/3 + A) , 

where Q{-) denotes the quantization to the nearest state, 
results a;2m-3 < 1 and X2m-i > 0, and thus all samples 
{xi, . . . ,X2m~i} are valid, that is, satisfy < Xt < I- 

Now, by applying Equation (|52]) into Equation ( |5T] i we get 

R{UEDM,x'r-') = iA2fc4/3 + lfc-2/3 ^ o(fc-l) 

= iA:-'/3^0(fc-i) . (53) 



Appendix II 
Proof of Lemma|4] 

Proof: Consider a sequence xi, ...,xl^i that rotates an 
FSM, denoted U, in a minimal circle, where xi induces a 
single up-jump of L states and induce down-jumps of a 
single state. Since the regret of any zero-step minimal circle is 
smaller than R, an input sample that satisfies x = xt — VR—e, 
where e 0+, must induce a down-jump of at least one 
state. Thus, we can always choose the input samples Xj^^ to 
satisfies Xt > Xt~ VR. We shall also assume that xi satisfies: 



xi> xi + (1 + 2L)Vi? 



(54) 



'Note that we can always apply ^ > as small as desired to ensure that 
the samples are not exactly equal to the transition threshold, but otherwise 
inside the regions of transition. For example, we could have taken xi = 
xi + (m + I - 5)Afc2/3 with ^ -> 0. 



where ii — Si. We show that this assumption can not hold 
true. 

By denoting X^. — Xf — xi we note that the empirical mean 
of the sequence satisfies: 



X > xi 



R 



(55) 



Now, let us examine the regret incurred by the described 
sequence: 

L+l 

R{U, ) = ZTT - ^*)' - - 

t=i 

L+l 

= {x- xif + lTT X! -^t ~ 2At(a;t - ii) 



i=l 

L+l 



>{x-xif 



L + l 



L+l 



t=l 



t=l 

L+l 

>-R+;^^(2\/i?-At)A 



(56) 
(57) 
(58) 



where ( pS] ) follows Af > and Xf < Xt for all the down 
samples X2^^, (|57| follows ( |55] l. In |10| it is shown that 
in an FSM with maximal regret R w.r.t. binary sequences, 
the maximal up-jump is no more than 2y/R. Therefore, this 
must hold also for continuous-valued sequences. Hence, in 
the discussed minimal circle all states are within 2^/R from 
the initial state, that is 2y/R > X-t for all t and we get 
RiU,x^) > R. 

We can now conclude that to attain a regret smaller than 
R, any input sample x that induces an L states up-jump from 
state i, must satisfy: 



x<S., + {1 + 2L)Vr . 



(59) 



Since an input sample 1 induces an mu.i states jump from 
state i we conclude that the following must be satisfied: 

1 < + (1 + 2m„^i)VR . (60) 

In the same manner it can be shown that > Si — {I + 
2md,i)VR. 



Appendix III 
Proof of Lemma[5] 



Proof: First we note that: 



j-Y^P,{xt-Xi) 
f=i 



t=i 



(61) 



where we used J2t=i ~ Note that PfXt is positive for 
up-steps and negative for down-steps. We consider a minimal 
circle within a segment A{rnu,md) that crosses states with 
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L 

t=i 



the same spacing gap, denoted A = A(m„,rnd)- It follows 
that: 

L t-1 

t=i j=i 

Define mixed sequences as sequences where the up and 
down steps are interlaced. Define straight sequences as se- 
quences where all the up-steps are first, followed by all the 
down-steps (consecutive in time). We show that any mixed 
sequence with {Pt}t=i jumps that rotates the machine in a 
minimal circle with the same spacing gap for all states can 
be transformed into a straight sequence with the same jumps 
only in a different order (up-jumps are first) without changing 
the spacing loss of the sequence. First we note that for any 
three interlaced jumps 



up jump — ^ down jump — > up jump, 



that cross 



Prl 



P, 



States (accordingly), the following holds true: 

f«,l^u,l + Pd{Xu,l + Pu,l^) + 

+ PuA^uS + {Pu.l+Pd)^) 
— -Pm,1^^m,1 + Pu,2{XuA + 

+ Pu,lA) + Pd{Xu,l + {Pu,l + Pu,2)^) ■ (62) 



Thus, Equation ( [62| i implies that the spacing loss of these three 
jump does not change when the order of the jumps is: 

up jump — >■ up jump — > down jump. 

This can be shown also for a sequence with more than one 
consecutive down-jumps between two up-steps: 



up jump down jump 



down jump up jump 



Hence, in a recursive way any mixed sequence can be trans- 
formed into a straight sequence without changing the spacing 
loss by moving all the down-jumps to the end of the sequence. 
In the rest of the proof we shall assume straight sequences. 
Note that this transformation changes the states of the minimal 
circle, but since we transform the sequence only for an easier 
analyze, we can assume that all states still have the same 



spacing gap. Figure 11 gives an example. 





Fig. 11. An example for a mixed sequence transformed into a straight 
sequence. 

We continue by proving that applying maximum up and 
down steps maximize the spacing loss. Consider two consec- 
utive down-steps of Pd^ , Pd^ states staring at state x, with a 



total of C states, i.e \Pd^ \ + \Pd2 1 = C- Note that we examine 
two down-steps, thus C < 2md- The spacing loss of these two 
down-steps is: 

x-\Pd,i\ + ix-\Pd,i\A)-\Pd,2\^ x-C-\Pd.i\iC-\Pd,i\)A . 

(63) 

If C < rud the spacing loss is maximized for \Pd.i\ = C 
and \Pd,2\ = 0. If nid < C < 2md then the spacing loss is 
maximized for \Pd,i \ = rud- We got that we can maximize the 
spacing loss by taking a couple of down-steps and unite them 
into a single down-step (if together they cross no more than 
nid states), or to apply maximum down-step, nid, to the first 
and C — nid to the second (if together they cross more than nid 
states). Thus, assuming straight sequences, we can start with 
the first couple of down-steps, maximize the spacing loss by 
applying maximum down-step, then take the third down-step 
and apply maximum down-step with the new down-steps that 
were created. In a recursive way we can maximize the spacing 
loss by applying maximum down-steps (note that the number 
of down-steps reduces which also maximize the spacing loss). 
In the same manner it can be shown that applying maximum 
up-steps maximize the spacing loss. 








Fig. 12. An example for the worst case spacing loss of a minimal circle that 
crosses 5 states in the segment A(3, 2). 

Consider a minimal circle of C states crossed on the way 
up and down, all in the segment A{mu,md). The worst case 
scenario for the spacing loss is composed of iV„ up-steps each 
of m„ states jump (maximum up-jump), a single up-step of 
c„ states, where c„ = mod{C,mu), Nd down-steps each of 
TTid States jump (maximum down-jump), and a single down- 
step of Cd states, where Cd = mod{C, nid). Nd and Nu satisfy 
C — Nuiriu + Cu and C — Ndirid + Cd. It can be shown that 
the position in the sequence of the single up-step (of c„ states) 
and the single down-step (of states) has no impact on the 
spacing loss. Let us analyze the spacing loss of the straight 
sequence. First, all up-steps satisfy: 

itz {»/? steps] 



;^A( ^ vTLuii ■ rriu) + NuUiuCu) 



i=0 



L 

= -^f(C2-m„C + c„(m„-c„)) . (64) 



In the same manner, all down-steps satisfy: 
-i E Pt{it-ii)^ 

t£{dowri steps} 

= J A(^ md{i ■ nid) + CdC) 

i=l 

= {A{^C^ + mdC-Cd{md-Cd)) . (65) 
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Thus, the worst case scenario of the spacing loss satisfies: 

L 



^Pt{xt - Xi) 



= i;f + ^d) - Cu{mu - Cu) - Cd{md - c^)) 

(66) 

<ifC(TO„+md), (67) 
where the length of the circle satisfies: 

Therefore, the worst case scenario satisfies: 

L 



(68) 



(69) 



t=i 



Since A — A(m„, rrid) — S^^ ^hat the spacing loss 

for any minimal circle within a segment (and with identical 
spacing gap between all states) satisfies: 

spacing loss < 4\/^=^ A(m„, ma) = R . (70) 



Appendix IV 
Proof of Lemma[6] 

Proof: We denote two adjacent segments by 
A(m„^i, m<i,i) and A(to„,2, ™d,2)- Assume A{mi,^i,md,i) 
is the lower segment and the minimal circle starts at the 
lowest state. Denote the spacing gap of each segment by 
Ai = A(m„_i,md^i) and A2 A{mu,2, 1^4,2) ■ Note that 
if Ai < A2 then m„.2 — rriu^i — 1 , md,2 = md,i and if 
Ai > A2 then m„_2 = m-«,i , mda - 1 = md^i- 



A, A, Ai 



Fig. 13. Spacing gap between states in tlie connection between tlie segments 
j4(mu,i, m^ i) and A(mu 2i "^d 2)- See the E-EDM machine definitions in 
section IVII 

First we assume that the minimal circle traverse between the 
segments only once (that is, once on the way up and once on 
the way down). We also assume that Ai < A2. We can now 
divide the minimal circle into two virtual minimal circles - take 
the up-step that traverse the machine to the higher segment 
and denote the destination state of this jump by Xc- Take a 
down-step that crosses state Xc and split it into two steps - 
assuming the down-step crosses Pd states, Cd states jump to 
state Xc and [Pd — Cd) states jump from state Xc- Note that 
two minimal circles were constructed - left minimal circle 
that traverse Ci states and right minimal circle that traverse 



C2 states. This is depict in Figure [14] The spacing loss of the 
down-step satisfies: 

Pd{Xc+CdAi) CdiXc+CdAi) + {Pd-Cd)Xc+iPd~Cd)cdAl ■ 

(71) 



Aj spacing gap {m^ +m^)-Ai 



A2 spacing gap 




Fig. 14. Minimal circle that traverse once between segments. Splitting the 
marked down-step that crosses state Xc into two down-steps, creating two 
virtual minimal circles to the right and left. Note that since the first 'm„,2 + 
states at the second segment are with spacing gap Ai, the marked 
down-step must only cross states with spacing gap Ai . 

Note that Xc is in the upper segment but we used Ai since 
the first m„ 2 2 states in the upper segment have spacing 
gap of Ai (see the construction of the E-EDM machine in 
section IVI-A 1. Also note that the first term in the right hand 



side of Equation ( |7T| ) belongs to the spacing loss of the right 
minimal circle and the middle term belongs to the spacing loss 
of the left minimal circle. Note that the spacing loss of the 
minimal circle is compose of the spacing loss of the left and 
right minimal circles and the last term in Equation ( |7T| ). The 
left minimal circle traverse Ci states, all with spacing gap 
Ai. The right minimal circle traverse C2 states, some with 
spacing gap Ai and some with A2. We can now conclude 
that the spacing loss satisfies: 

spacing loss < 4\/i?-^( [Ci(m„_i + m^^i) 
- {Pd~Cd){md,i - {Pd-Cd))]^ 
+ [C2{mu,2 + md,2) - Cdimd,2 ~ Cd)]^ 
+ Cd{Pd~Cd)Ai ) , (72) 

where we applied Lemma |5] (Equation ( |66| )) to bound the 
spacing loss of the left and right minimal circles. Note that 
Lemma |5] is true for the right minimal circle since all states 
have a spacing gap that is no more than A2. Now, since 
md.i = md,2 and Ai < A2 we get: 

spacing loss < 4\/i?j(Ci(r7iti_i + 

+ C'2(»7in,2 



mds)^-i 
+ md,2)^ 

"Id, 2 



C2 



Let us bound the length of the minimal circle 



) 



C2 



rrid.i 



C1+C2- 
nid.l 



Applying this into Equation ( |73] l results: 
spacing loss < R . 



(73) 



(74) 



(75) 



Assume again that the minimal circle traverse between the 
segments only once but now assume Ai > A2. Divide the 
minimal circle into two virtual minimal circles in the same 
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manner as above but now take the down-step that traverse the 
machine to the lower segment and split an up-step. In the same 
manner we can show that the spacing loss is not more than 
R. 

If assuming that the minimal circle traverse between seg- 
ments TO times, in the same manner as above we can divide 
the circle into m left minimal circles and to right minimal 
circles and bound the spacing loss. 



Appendix V 
Proof of Theorem|7] 



Proof: Consider an E-EDM machine that was designed to 

Id 

2 



attain maximal regret Rd- By denoting R = the number 



of states satisfies: 

k < 



E 



|A(m^,mri)| 



+ 2) 



(76) 



where all states in the segment A{mu,md) have a maximum 
up and down step of to„, to^j states and A(TO„,TOd) spacing 
gap. As shown in the definitions of the E-EDM machine in 
section VI we add to each segment at most two states to 



ensure regret smaller than Rd for sequences that rotate the 
E-EDM machine in a minimal circle that traverse between 
segments. Note that there are at most f:^:^! segments. 
Let us examine Equation (|76]i: 



k < + 2 

= + 2 



E 



|A(m„,7rad)| 
A(mu,md) 



E M(~^2to„to, 

niu ,mc; £N 



= + 2 + 2i?-i/2 J2 \A{mu,md)\ 

r 1 — x—\/R -\ r x—\/R -\ 

• (x(l -x) + Vr + R) 



in the lower and upper halves is equal, we get: 

k < +3i?"i/2) +2+ 



By denoting the segments with the same maximum up-step as 
B{mu), we can further bound the number of states: 



k<l{R-^ +m~^'^) + 2 + \R-'"^ |B(to, 



(78) 



max x{\ — x) 



Since \B{m.u) \ = 2Vi? for almost all to„ (|i3(TO„)| < 2\/R at 
the edges of the interval [0, i]), x{\ — x) is a concave function 
with a singular maximum point at ^ and the number of states 



^-3/2 2\^.{y/R + i2y/R){l- {y/R + i2y/R)) 



< Yji?"^/^ - ^R^^ - 12i?-i/2 _ 32 

_ 2-'/" n-3/2 n(U-^\ 

where we applied R = 



(79) 



We can also bound the number of states from below by: 



k > 



E 



ld)i 



E 



\A{mu,md)\ ■ {x{l - x)- 



(80) 



R + R), 

By denoting the segments with the same maximum up-step as 
B{mu), we can bound the number of states from below: 

fc> \ {- R^^ +R^^'^ + 

+ R-^/^ V |B(to„)|- min x(l - x)) . 

(81) 

Using the approximation we made to calculate the lower bound 
we get: 

> - 15i?"^ + 2R-^I'^) 



12 V 2 



-3/2 



0{R-, 



(82) 



(77) 



Thus, we upper and lower bounded the number of states in 
the E-EDM machine by ( ^ ) ~3/2 + 0(7?-!). ■ 
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